Case Study 05: Analysis of failures and rekomendation for delivery of repair parts.

Course: IDA (Introduction to Data Analytics with R) - TU Belrin

Group 05: Aymen Baklouti, Dimitri Litvinenko, Ichrak Allagoui, Manuel Otero, Qing Huang

SS 2022, 16.09.2022

Main briefing

You work in a chain of auto repair shops in Bavaria. You operate workshops in the cities of Augsburg, Ingolstadt, Regensburg, Würzburg, Bamberg, Bayreuth, Aschaffenburg, Erlangen, Rosenheim and Landshut. Analyze the data from the last three years (2014 - 2016 inclusive). Based on your analysis, make a recommendation to each repair shop on how much of which part should be delivered to each repair shop.

Furthermore: - A vehicles with defects were repaired in the city where they were registered - A vehicle is always considered to be defective if an installed individual part, an installed component or the entire vehicle is marked as defective - Data sets with vehicles, components and parts information, also registration data and geodata, are stored in the given data-base order. - The categories of data sets are listed below: Single Part = Einzelteil Component = Komponente Vehicle = Fahrzeug Geodata = Geodaten Registrations = Zulassungen Logistics delay = Logistikverzug

Procedure of the case study

Install required Packages

According to used commands, install and load necessary packages.

if(!require(rmdformats)){
  install.packages("rmdformats")
}
library(rmdformats)

if(!require(prettydoc)){
  install.packages("prettydoc")
}
library(prettydoc)

if(!require(kableExtra)){
  install.packages("kableExtra")
}
library(kableExtra)

if (requireNamespace("thematic")) 
  thematic::thematic_rmd(font = "auto")   

if(!require(install.load)){
  install.packages("install.load")
}
library(install.load)

if(!require(tidyverse)){
  install.packages("tidyverse")
}
library(tidyverse)

if(!require(knitr)){
  install.packages("knitr")
}
library(knitr)

if(!require(reader)){
  install.packages("reader")
}
library(reader)

if (!require(DT)) {
  install.packages("DT")
}
library(DT)

if (!require(plotly)) {
  install.packages("plotly")
}
library(plotly)

if (!require(scales)) {
  install.packages("scales")
}
library(scales)

if (!require(stringr)) {
  install.packages("stringr")
}
library(stringr)

if (!require(ggplot2)) {
  install.packages("ggplot2")
}
library(ggplot2)

if (!require(RJSONIO)) {
  install.packages("RJSONIO")
}
library(RJSONIO)

if (!require(dplyr)) {
  install.packages("dplyr")
}
library(dplyr)

if (!require(leaflet)) {
  install.packages("leaflet")
}
library(leaflet)

if (!require(leaflet.extras)) {
  install.packages("leaflet.extras")
}
library(leaflet.extras)

if (!require(lubridate)) {
  install.packages("lubridate")
}
library(lubridate)

if (!require(forecast)) {
  install.packages("forecast")
}
library(forecast)

if (!require(padr)) {
  install.packages("padr")
}
library(padr)

if (!require(tidyquant)) {
  install.packages("tidyquant")
}
library(tidyquant)

if (!require(timetk)) {
  install.packages("timetk")
}
library(timetk)

if (!require(data.table)) {
  install.packages("data.table")
  require(data.table)
}
library(data.table)

Import

Registrations directory

We set a list with cities we operate in as repair shop owner. During the procedure we switch between the directory’s. For that reason we keep each path as character string.

Cities <- sort(c("AUGSBURG", "INGOLSTADT", "REGENSBURG", "WUERZBURG", "BAMBERG", 
            "BAYREUTH", "ASCHAFFENBURG", "ERLANGEN", "ROSENHEIM", "LANDSHUT"))

Imported_Files      <- c()
Current_Path        <- getwd()

Path_Parts          <- "./Data/Einzelteil/"
Path_Groups         <- "./Data/Komponente/"
Path_Vehicle        <- "./Data/Fahrzeug/"
Path_Registrations  <- "./Data/Zulassungen/"
Path_Geodata        <- "./Data/Geodaten/"
Path_LogisticsDelay <- "./Data/Logistikverzug/"

Geodata are used to show the localization of the repair shops on the map.

setwd(Path_Geodata)
Geodata <- read_csv2("Geodaten_Gemeinden_v1.2_2017-08-22_TrR.csv", col_types = "dddccc")
Geodata <- dplyr::select(Geodata, -1, -2)

head(Geodata)
>  # A tibble: 6 × 4
>    Postleitzahl Gemeinde          Laengengrad Breitengrad
>           <dbl> <chr>             <chr>       <chr>      
>  1         1067 DRESDEN           13,736883   51,051697  
>  2         1445 RADEBEUL          13,659342   51,105975  
>  3         1454 RADEBERG          13,921969   51,116404  
>  4         1454 WACHAU            13,904681   51,160116  
>  5         1458 OTTENDORF OKRILLA 13,831724   51,187617  
>  6         1468 MORITZBURG        13,68009    51,159495

Import data from the Registrations directory. For later analysis create a set of registration-ID´s, from cities we have an repairshop in. We also checked all locations for failures like case inconsistency or grammar failures, after grouping them by “Gemeinden”.

# get into targetpath:
setwd(Path_Registrations)
Imported_IDs              <- c("Registrations_Data")
Imported_Files            <- append(Imported_Files, c(Imported_IDs))
# Make a list of main groups of data-types:
Files_txt                 <- list.files(pattern = ("*.txt"))
Files_csv                 <- list.files(pattern = ("*.csv"))

Registrations_Data        <- read_csv2(Files_csv, col_types = "dccD")

# Check the correctness of the names manually:
Cities_Summary            <- dplyr::count(Registrations_Data, Gemeinden)

# Make a summary in each interested city for better understanding of the data:
Cars_Registered_Grouped   <- group_by(Registrations_Data, Gemeinden)

# Save interested ID´s for further exploration:
# IDNumbers_Vector           <- sort(pull(Cars_Registered_Grouped, IDNummer))   
IDNumbers_Vector           <- Cars_Registered_Grouped %>% 
                          filter(Gemeinden %in% Cities) %>% 
                          pull(IDNummer)

# How much registrations in each of our cities exists:
Cars_Registered_Summary   <- Cars_Registered_Grouped %>% 
                          filter(Gemeinden %in% Cities) %>% 
                          count("Gemeinden")
print(Cars_Registered_Summary)
>  # A tibble: 10 × 3
>  # Groups:   Gemeinden [10]
>     Gemeinden     `"Gemeinden"`     n
>     <chr>         <chr>         <int>
>   1 ASCHAFFENBURG Gemeinden      7032
>   2 AUGSBURG      Gemeinden     23532
>   3 BAMBERG       Gemeinden      6692
>   4 BAYREUTH      Gemeinden      6672
>   5 ERLANGEN      Gemeinden      9882
>   6 INGOLSTADT    Gemeinden     12216
>   7 LANDSHUT      Gemeinden      6410
>   8 REGENSBURG    Gemeinden     12190
>   9 ROSENHEIM     Gemeinden      6030
>  10 WUERZBURG     Gemeinden     10878

Vehicle directory

This Path had 8 csv-files - 4 with relations tables and 4 sets with data about specific vehicles. All of them has csv-type could be imported with read_delim().

# get into target-path:
setwd(Path_Vehicle)
Imported_Vehicles <- c("Relations_Data_11", "Relations_Data_12", "Relations_Data_21", "Relations_Data_22",
                       "Vehicle_Data_11", "Vehicle_Data_12", "Vehicle_Data_21", "Vehicle_Data_22")
#Imported_Files <- append(Imported_Files, Imported_Vehicles)

# Make a list of main groups of data-types:
Files_csv                 <- list.files(pattern = ("*.csv"))

Relations_Data_11         <- read_delim(Files_csv[1], col_types = "dccccc")

# Find out variable types
spec(Relations_Data_11)

cols( …1 = col_double(), ID_Karosserie = col_character(), ID_Schaltung = col_character(), ID_Sitze = col_character(), ID_Motor = col_character(), ID_Fahrzeug = col_character() )

Relations_Data_12         <- read_delim(Files_csv[2], col_types = "dccccc")
Relations_Data_21         <- read_delim(Files_csv[3], col_types = "dccccc")
Relations_Data_22         <- read_delim(Files_csv[4], col_types = "dccccc")

Vehicle_Data_11           <- read_delim(Files_csv[5], col_types = "ddcDdddDd")
Vehicle_Data_12           <- read_delim(Files_csv[6], col_types = "ddcDdddDd")
Vehicle_Data_21           <- read_delim(Files_csv[7], col_types = "ddcdddDddc")
Vehicle_Data_22           <- read_delim(Files_csv[8], col_types = "ddcdddDddc")

# Filter on this point to thin out the data.
filter_by_ID <- function(data, ID) {
  data <- data %>% filter(ID_Fahrzeug %in% ID)
  return(data)
}

# Make sure the Date Variable is set as "Date-Format"
Vehicle_Data_11$Fehlerhaft_Datum <- as.Date(Vehicle_Data_11$Fehlerhaft_Datum, 
                                            tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
Vehicle_Data_12$Fehlerhaft_Datum <- as.Date(Vehicle_Data_12$Fehlerhaft_Datum, 
                                            tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
Vehicle_Data_21$Fehlerhaft_Datum <- as.Date(Vehicle_Data_21$Fehlerhaft_Datum, 
                                            tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
Vehicle_Data_22$Fehlerhaft_Datum <- Vehicle_Data_22$Fehlerhaft_Datum %>% 
                                    as.Date(tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))

Components directory

Importing components: Data stored as string of characters in a txt-File had different delimiters for rows and columns. We opened each to determine the delimiters. The Procedure of importing data from txt is read the content as a string and replace inconvenient separators with ones can be detected by the fread()-function. Than we have a format of string-data can be stored in a usual table or a tibble.

# get into targetpath:
setwd(Path_Groups)
Imported_Groups <- sort(c("Komponente_K1DI2", "Komponente_K2LE1", 
                         "Komponente_K2LE2", "Komponente_K2ST1", 
                         "Komponente_K3AG2", "Komponente_K7", 
                         "Komponente_K1BE1", "Komponente_K1BE2", 
                         "Komponente_K1DI1", "Komponente_K2ST2", 
                         "Komponente_K3AG1", "Komponente_K3SG1", 
                         "Komponente_K3SG2", "Komponente_K4", 
                         "Komponente_K5", "Komponente_K6",
                         "Bestandteile_Komponente_K1DI2", 
                         "Bestandteile_Komponente_K2LE1", 
                         "Bestandteile_Komponente_K2LE2", 
                         "Bestandteile_Komponente_K2ST1", 
                         "Bestandteile_Komponente_K3AG2", 
                         "Bestandteile_Komponente_K7", 
                         "Bestandteile_Komponente_K1BE1", 
                         "Bestandteile_Komponente_K1BE2", 
                         "Bestandteile_Komponente_K1DI1", 
                         "Bestandteile_Komponente_K2ST2", 
                         "Bestandteile_Komponente_K3AG1", 
                         "Bestandteile_Komponente_K3SG1", 
                         "Bestandteile_Komponente_K3SG2", 
                         "Bestandteile_Komponente_K4", 
                         "Bestandteile_Komponente_K5", 
                         "Bestandteile_Komponente_K6"))

Imported_Files <- append(Imported_Files, Imported_Groups)

# Make a list of main groups of data-types:
Files_txt                 <- list.files(pattern = ("*.txt"))
Files_csv                 <- list.files(pattern = ("*.csv"))

Sep_txt                   <- c(c("K1DI2", "\\\\", "\\t"), 
                          c("K2LE1", "II",   ""), 
                          c("K2LE2", "\\\\", "\\n"), 
                          c("K2ST1", "\\|",  "\\n"),
                          c("K3AG2", "\\\\", "\\n"),
                          c("K7",    "\\t",  "\\n"))

# import txt-files with VarNames as DataNames
Data_Name                 <- sub("*.txt", "", Files_txt[1])
Data_Group                <- read_file(Files_txt[1]) %>% 
                          gsub(pattern = "\\t", replacement = "\n") %>%
                          gsub(pattern = "\\\\", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group)) 

Data_Name                 <- sub("*.txt", "", Files_txt[2])
Data_Group                <- read_file(Files_txt[2]) %>%
                          gsub(pattern = "", replacement = "\n") %>%
                          gsub(pattern = "II", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group))

Data_Name                 <- sub("*.txt", "", Files_txt[3])
Data_Group                <- read_file(Files_txt[3]) %>% 
                          gsub(pattern = "\\n", replacement = "\n") %>%
                          gsub(pattern = "\\\\", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group)) 

Data_Name                 <- sub("*.txt", "", Files_txt[4])
Data_Group                <- read_file(Files_txt[4]) %>% 
                          gsub(pattern = "\\n", replacement = "\n") %>%
                          gsub(pattern = "\\|", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group)) 

Data_Name                 <- sub("*.txt", "", Files_txt[5])
Data_Group                <- read_file(Files_txt[5]) %>% 
                          gsub(pattern = "\\n", replacement = "\n") %>%
                          gsub(pattern = "\\\\", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group)) 

Data_Name                 <- sub("*.txt", "", Files_txt[6])
Data_Group                <- read_file(Files_txt[6]) %>% 
                          gsub(pattern = "\\n", replacement = "\n") %>%
                          gsub(pattern = "\\t", replacement = ";") %>%
                          fread()
do.call("<-", list(Data_Name, Data_Group)) 

rm(Data_Group)

# Import all csv-Files from current path:
for (i in 1:length(Files_csv)) {
  Data_Name <- sub("*.csv", "", Files_csv[i])
  do.call("<-", list(Data_Name, read_delim(Files_csv[i], 
                                           show_col_types = FALSE)))
}

Importing Parts_Data: The parts Files contains of *.csv’ with “;” and “,” as separator. And *.txt with random separators. File T22 and T35 has addition difficulties to import, because of Row Separator is an empty char “” and the possibility that alphanumeric characters precede and follows a double quoted value is possible in different cases.

# Read Parts-Files from destination-directory
setwd(Path_Parts)
Files_txt       <- list.files(pattern = ("*.txt"))
Files_csv       <- list.files(pattern = ("*.csv"))
Imported_Parts  <- c("Einzelteil_T01", "Einzelteil_T02", "Einzelteil_T03", 
                    "Einzelteil_T04", "Einzelteil_T05", "Einzelteil_T06", 
                    "Einzelteil_T07", "Einzelteil_T08", "Einzelteil_T09", 
                    "Einzelteil_T10", "Einzelteil_T11", "Einzelteil_T12", 
                    "Einzelteil_T13", "Einzelteil_T14", "Einzelteil_T15", 
                    "Einzelteil_T16", "Einzelteil_T17", "Einzelteil_T18", 
                    "Einzelteil_T19", "Einzelteil_T20", "Einzelteil_T21", 
                    "Einzelteil_T22", "Einzelteil_T23", "Einzelteil_T24", 
                    "Einzelteil_T25", "Einzelteil_T26", "Einzelteil_T27", 
                    "Einzelteil_T30", "Einzelteil_T31", "Einzelteil_T32",
                    "Einzelteil_T33", "Einzelteil_T34", "Einzelteil_T35", 
                    "Einzelteil_T36", "Einzelteil_T37", "Einzelteil_T38", 
                    "Einzelteil_T39", "Einzelteil_T40")

# *.txt File als einen Character_string lesen
Einzelteil_T01 <- read_file(Files_txt[1]) %>% 
                  gsub(rep = ";", pattern = " \\| \\| ") %>% 
                  gsub(rep = "\n", pattern = " ") %>% 
                  fread()

Einzelteil_T02 <- read_file(Files_txt[2]) %>% 
                  gsub(rep = ";", pattern = "  ") %>%
                  gsub(rep = "\n", pattern = "\\t") %>%
                  fread()

Einzelteil_T03 <- read_file(Files_txt[3]) %>%
                  gsub(rep = ";", pattern = "\\|") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T07 <- read_file(Files_txt[4]) %>%
                  gsub(rep = "\"\n\"", pattern = "\"\"") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T09 <- read_file(Files_txt[5]) %>%
                  gsub(rep = "\n", pattern = "") %>%
                  gsub(rep = ";", pattern = "\\\\") %>%
                  fread()

Einzelteil_T11 <- read_file(Files_txt[6]) %>%
                  gsub(rep = ";", pattern = "\\t") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T16 <- read_file(Files_txt[7]) %>%
                  gsub(rep = ";", pattern = " \\| \\| ") %>%
                  gsub(rep = "\n", pattern = "\\t") %>%
                  fread()

Einzelteil_T20 <- read_file(Files_txt[8]) %>%
                  gsub(rep = ";", pattern = " \\| \\| ") %>%
                  gsub(rep = "\n", pattern = " ") %>%
                  fread()

Einzelteil_T22 <- read_file(Files_txt[9]) %>%
                  gsub(replacement = ";", pattern = "\\t") %>% 
                  gsub(replacement = "\"\n\"", pattern = "\"\"") %>% 
                  gsub(rep = "\\1\n\"\\3", pat = "([A0-9])(\")([^\";])") %>% 
                  fread()

Einzelteil_T24 <- read_file(Files_txt[10]) %>%
                  gsub(rep = ";", pattern = "  ") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T27 <- read_file(Files_txt[11]) %>%
                  gsub(rep = ";", pattern = " \\| \\| ") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T31 <- read_file(Files_txt[12]) %>%
                  gsub(rep = ";", pattern = "  ") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

Einzelteil_T34 <- read_file(Files_txt[13]) %>%
                  gsub(rep = ";", pattern = " \\| \\| ") %>%
                  gsub(rep = "\"\n\"", pattern = "\"\"") %>%
                  fread()

Einzelteil_T35 <- read_file(Files_txt[14]) %>%
                  gsub(replacement = ";", pattern = "\\\\") %>% 
                  gsub(replacement = "\"\n\"", pattern = "\"\"") %>% 
                  gsub(rep = "\\1\n\"\\3", pat = "([A0-9])(\")([^\";])") %>% 
                  fread()

Einzelteil_T36 <- read_file(Files_txt[15]) %>%
                  gsub(rep = ";", pattern = "  ") %>%
                  gsub(rep = "\n", pattern = " ") %>%
                  fread()

Einzelteil_T39 <- read_file(Files_txt[16]) %>%
                  gsub(rep = ";", pattern = "\\\\") %>%
                  gsub(rep = "\n", pattern = "") %>%
                  fread()

# Import all csv-Files from current path:
for (i in 1:length(Files_csv)) {
  Data_Name <- sub("*.csv", "", Files_csv[i])
  do.call("<-", list(Data_Name, read_delim(Files_csv[i], 
                                           show_col_types = FALSE)))
}

Imported_Files <- append(Imported_Files, c(Imported_Parts))

Data preparation

Observing imported data sets

A glimpse() to the imported data shows, that several data frames have unnecessary variables at the beginning. Like unnamed first column or in some cases a variable X1. Both are from type numeric and probably used as counting numbers. Since we have to restructure the complete frames regarding period and location, this information would lost his signification. All files for groups-tibbles have 2 unused vars on the beginning. Groups relations-tibbles have 1 unused var, except K2ST1. Examples for “Komponente_K1BE1”, “Bestandteile_Komponente_K1BE1” and the outstanding one “Bestandteile_Komponente_K2ST1”:

glimpse(Komponente_K1BE1)
>  Rows: 1,192,630
>  Columns: 10
>  $ ...1                             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
>  $ X1                               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
>  $ ID_Motor                         <chr> "K1BE1-101-1011-7", "K1BE1-101-1011-1…
>  $ Herstellernummer                 <dbl> 101, 101, 101, 101, 101, 101, 101, 10…
>  $ Werksnummer                      <dbl> 1011, 1011, 1011, 1011, 1011, 1011, 1…
>  $ Fehlerhaft                       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0…
>  $ Fehlerhaft_Datum                 <date> NA, NA, NA, NA, NA, NA, NA, 2010-04-…
>  $ Fehlerhaft_Fahrleistung          <dbl> 0, 0, 0, 0, 0, 0, 0, 31767, 0, 0, 317…
>  $ Produktionsdatum_Origin_01011970 <dbl> 14195, 14196, 14196, 14195, 14196, 14…
>  $ origin                           <chr> "01-01-1970", "01-01-1970", "01-01-19…
glimpse(Bestandteile_Komponente_K1BE1)
>  Rows: 1,192,630
>  Columns: 6
>  $ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
>  $ ID_T1    <chr> "1-201-2011-45", "1-201-2011-429", "1-201-2011-399", "1-201-2…
>  $ ID_T2    <chr> "2-201-2011-161", "2-201-2011-239", "2-201-2011-220", "2-202-
>  $ ID_T3    <chr> "3-202-2023-14", "3-202-2023-16", "3-202-2023-46", "3-202-202…
>  $ ID_T4    <chr> "4-202-2023-20", "4-202-2023-51", "4-202-2023-93", "4-204-204…
>  $ ID_K1BE1 <chr> "K1BE1-101-1011-1", "K1BE1-101-1011-2", "K1BE1-101-1011-3", "…
glimpse(Bestandteile_Komponente_K2ST1)
>  Rows: 1,908,208
>  Columns: 6
>  $ X1       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
>  $ X        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
>  $ ID_T11   <chr> "11-213-2131-128", "11-213-2131-134", "11-212-2121-107", "11-
>  $ ID_T12   <chr> "12-212-2121-15", "12-212-2121-28", "12-212-2121-73", "12-212…
>  $ ID_T13   <chr> "13-209-2091-25", "13-209-2092-7", "13-209-2092-69", "13-209-
>  $ ID_K2ST1 <chr> "K2ST1-109-1092-1", "K2ST1-109-1092-2", "K2ST1-109-1092-3", "…

The frames for relations of the vehicle data have the same issue in the first column and the imported parts respectively vehicle frames first both variables with counting values. Here view examples for vehicle relations, vehicle data and parts data.

str(Relations_Data_11)
>  spec_tbl_df [1,977,164 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
>   $ ...1         : num [1:1977164] 1 2 3 4 5 6 7 8 9 10 ...
>   $ ID_Karosserie: chr [1:1977164] "K4-112-1121-3" "K4-112-1121-4" "K4-112-1121-7" "K4-112-1121-9" ...
>   $ ID_Schaltung : chr [1:1977164] "K3SG1-105-1051-32" "K3SG1-105-1051-141" "K3SG1-105-1051-106" "K3SG1-105-1051-21" ...
>   $ ID_Sitze     : chr [1:1977164] "K2LE1-109-1091-2" "K2ST1-109-1092-5" "K2ST1-109-1092-57" "K2ST1-109-1092-91" ...
>   $ ID_Motor     : chr [1:1977164] "K1BE1-101-1011-7" "K1BE1-101-1011-12" "K1BE1-101-1011-38" "K1BE1-101-1011-97" ...
>   $ ID_Fahrzeug  : chr [1:1977164] "11-1-11-1" "11-1-11-2" "11-1-11-3" "11-1-11-4" ...
>   - attr(*, "spec")=
>    .. cols(
>    ..   ...1 = col_double(),
>    ..   ID_Karosserie = col_character(),
>    ..   ID_Schaltung = col_character(),
>    ..   ID_Sitze = col_character(),
>    ..   ID_Motor = col_character(),
>    ..   ID_Fahrzeug = col_character()
>    .. )
>   - attr(*, "problems")=<externalptr>
str(Vehicle_Data_11)
>  spec_tbl_df [1,977,164 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
>   $ ...1                   : num [1:1977164] 1 2 3 4 5 6 7 8 9 10 ...
>   $ X1                     : num [1:1977164] 1 2 3 4 5 6 7 8 9 10 ...
>   $ ID_Fahrzeug            : chr [1:1977164] "11-1-11-1" "11-1-11-2" "11-1-11-3" "11-1-11-4" ...
>   $ Produktionsdatum       : Date[1:1977164], format: "2008-11-18" "2008-11-18" ...
>   $ Herstellernummer       : num [1:1977164] 1 1 1 1 1 1 1 1 1 1 ...
>   $ Werksnummer            : num [1:1977164] 11 11 11 11 11 11 11 11 11 11 ...
>   $ Fehlerhaft             : num [1:1977164] 0 0 0 0 0 0 0 0 1 0 ...
>   $ Fehlerhaft_Datum       : Date[1:1977164], format: NA NA ...
>   $ Fehlerhaft_Fahrleistung: num [1:1977164] 0 0 0 0 0 ...
>   - attr(*, "spec")=
>    .. cols(
>    ..   ...1 = col_double(),
>    ..   X1 = col_double(),
>    ..   ID_Fahrzeug = col_character(),
>    ..   Produktionsdatum = col_date(format = ""),
>    ..   Herstellernummer = col_double(),
>    ..   Werksnummer = col_double(),
>    ..   Fehlerhaft = col_double(),
>    ..   Fehlerhaft_Datum = col_date(format = ""),
>    ..   Fehlerhaft_Fahrleistung = col_double()
>    .. )
>   - attr(*, "problems")=<externalptr>
str(Einzelteil_T01)
>  Classes 'data.table' and 'data.frame':   3204104 obs. of  23 variables:
>   $ V1                       : int  1 2 3 4 5 6 7 8 9 10 ...
>   $ X1                       : int  660 661 662 663 664 665 666 667 668 669 ...
>   $ ID_T01.x                 : chr  "1-201-2011-247" "1-201-2011-429" "1-201-2011-363" "1-201-2011-30" ...
>   $ Produktionsdatum.x       : IDate, format: "2008-11-07" "2008-11-07" ...
>   $ Herstellernummer.x       : int  201 201 201 201 201 201 201 201 201 201 ...
>   $ Werksnummer.x            : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
>   $ Fehlerhaft.x             : int  0 0 1 0 1 0 0 1 1 0 ...
>   $ Fehlerhaft_Datum.x       : IDate, format: NA NA ...
>   $ Fehlerhaft_Fahrleistung.x: int  0 0 12983 0 12983 0 0 12983 12935 0 ...
>   $ ID_T01.y                 : chr  NA NA NA NA ...
>   $ Produktionsdatum.y       : IDate, format: NA NA ...
>   $ Herstellernummer.y       : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Werksnummer.y            : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Fehlerhaft.y             : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Fehlerhaft_Datum.y       : IDate, format: NA NA ...
>   $ Fehlerhaft_Fahrleistung.y: num  NA NA NA NA NA NA NA NA NA NA ...
>   $ ID_T01                   : chr  NA NA NA NA ...
>   $ Produktionsdatum         : IDate, format: NA NA ...
>   $ Herstellernummer         : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Werksnummer              : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Fehlerhaft               : int  NA NA NA NA NA NA NA NA NA NA ...
>   $ Fehlerhaft_Datum         : IDate, format: NA NA ...
>   $ Fehlerhaft_Fahrleistung  : num  NA NA NA NA NA NA NA NA NA NA ...
>   - attr(*, ".internal.selfref")=<externalptr>

Bring data into consistent format

Delete first column used as counting number of rows.

Relations_Data_11 <- dplyr::select(Relations_Data_11, -1)
Relations_Data_12 <- dplyr::select(Relations_Data_12, -1)
Relations_Data_21 <- dplyr::select(Relations_Data_21, -1)
Relations_Data_22 <- dplyr::select(Relations_Data_22, -1)

Same reason: delete unecessary infirmation. For further automation in a loop we call “<-” as a function.

do.call("<-", list("Vehicle_Data_11", dplyr::select(get("Vehicle_Data_11"), -1, -2)))
do.call("<-", list("Vehicle_Data_12", dplyr::select(get("Vehicle_Data_12"), -1, -2)))
do.call("<-", list("Vehicle_Data_21", dplyr::select(get("Vehicle_Data_21"), -1, -2)))
do.call("<-", list("Vehicle_Data_22", dplyr::select(get("Vehicle_Data_22"), -1, -2)))

Using iterative method, going trough the list with imported parts-data frames, delete firs or first two columns with counting numbers.

# Delete counting Columns
for (i in 1:length(Imported_Parts)){
  do.call("<-", list(Imported_Parts[i], dplyr::select(get(Imported_Parts[i]), 
                                               -1, -2)))
}

# Same for components:
for (i in 1:length(Imported_Groups)){
  if (!is.null(grep("Bestand", Imported_Groups[i], invert = TRUE)) & 
      length(grep("Bestand", Imported_Groups[i], invert = TRUE)) == 0){
    do.call("<-", list(Imported_Groups[i], dplyr::select(get(Imported_Groups[i]), 
                                                  -1))) 
  } else {
    do.call("<-", list(Imported_Groups[i], dplyr::select(get(Imported_Groups[i]), 
                                                  -1, -2)))
  }
}

# Outstanding data frame as exception:
do.call("<-", list("Bestandteile_Komponente_K2ST1", 
                   dplyr::select(Bestandteile_Komponente_K2ST1, -1)))

Registrations_Data <- dplyr::select(Registrations_Data, -1)

Some of the tibbles have different rows containing one type of information repeated once or twice. This variables have “.x” or “.y” as additional suffix respectively.

glimpse(Einzelteil_T01)
>  Rows: 3,204,104
>  Columns: 21
>  $ ID_T01.x                  <chr> "1-201-2011-247", "1-201-2011-429", "1-201-2…
>  $ Produktionsdatum.x        <date> 2008-11-07, 2008-11-07, 2008-11-07, 2008-11…
>  $ Herstellernummer.x        <int> 201, 201, 201, 201, 201, 201, 201, 201, 201,…
>  $ Werksnummer.x             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20…
>  $ Fehlerhaft.x              <int> 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,…
>  $ Fehlerhaft_Datum.x        <date> NA, NA, 2009-09-30, NA, 2009-09-30, NA, NA,…
>  $ Fehlerhaft_Fahrleistung.x <int> 0, 0, 12983, 0, 12983, 0, 0, 12983, 12935, 0…
>  $ ID_T01.y                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Produktionsdatum.y        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
>  $ Herstellernummer.y        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Werksnummer.y             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Fehlerhaft.y              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Fehlerhaft_Datum.y        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
>  $ Fehlerhaft_Fahrleistung.y <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ ID_T01                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Produktionsdatum          <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
>  $ Herstellernummer          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Werksnummer               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Fehlerhaft                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>  $ Fehlerhaft_Datum          <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
>  $ Fehlerhaft_Fahrleistung   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

To fix this diversities we had to split the tibbles into two or three parts, rename the repeated columns and bind this parts below each other with rbind(). Some Tables had the double some tripple number of columns. We observe each first and than handle this cases separately.

# observe data with 14 Variables:
Temp_01           <- dplyr::select(Einzelteil_T01, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T01, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T01    <- dplyr::select(Einzelteil_T01, -c(1:14))
Einzelteil_T01    <- rbind(Einzelteil_T01, rbind(Temp_01, Temp_02))

Temp_01           <- dplyr::select(Einzelteil_T02, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T02, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T02    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T05, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T05, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T05    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T09, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T09, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T09    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T15, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T15, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T15    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T23, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T23, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T23    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T32, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T32, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T32    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T35, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T35, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T35    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T38, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T38, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T38    <- rbind(Temp_01, Temp_02)

Temp_01           <- dplyr::select(Einzelteil_T39, 1:7)
Temp_02           <- dplyr::select(Einzelteil_T39, 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
Einzelteil_T39    <- rbind(Temp_01, Temp_02)

Data_Name         <- "Einzelteil_T12"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Einzelteil_T16"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Einzelteil_T17"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Einzelteil_T22"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Einzelteil_T24"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Einzelteil_T30"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))


# Same procedure for the Groups data:
Data_Name         <- "Komponente_K1DI1"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Komponente_K3AG1"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))


# Prepair of 14 rows
Data_Name         <- "Komponente_K2LE1"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, dplyr::select(get(Data_Name), -c(1:14))))
do.call("<-", list(Data_Name, rbind(get(Data_Name), rbind(Temp_01, Temp_02))))

Data_Name         <- "Komponente_K3SG1"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, rbind(Temp_01, Temp_02)))

Data_Name         <- "Komponente_K4"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, rbind(Temp_01, Temp_02)))

Data_Name         <- "Komponente_K5"
Temp_01           <- dplyr::select(get(Data_Name), 1:7)
Temp_02           <- dplyr::select(get(Data_Name), 8:14)
colnames(Temp_01) <- gsub(x = colnames(Temp_01), pattern = "*\\.x", rep = "")
colnames(Temp_02) <- gsub(x = colnames(Temp_02), pattern = "*\\.y", rep = "")
do.call("<-", list(Data_Name, rbind(Temp_01, Temp_02)))

There was Datasets have variables called “Produktionsdatum_origin_01011970” and “Origin”.

head(Einzelteil_T03)
>             ID_T03 Herstellernummer Werksnummer Fehlerhaft Fehlerhaft_Datum
>  1: 3-202-2023-249              202        2023          0             <NA>
>  2:   3-202-2022-8              202        2022          0             <NA>
>  3: 3-202-2023-192              202        2023          0             <NA>
>  4:  3-202-2023-16              202        2023          0             <NA>
>  5: 3-202-2023-258              202        2023          0             <NA>
>  6:   3-202-2022-6              202        2022          0             <NA>
>     Fehlerhaft_Fahrleistung Produktionsdatum_Origin_01011970     origin
>  1:                       0                            14190 01-01-1970
>  2:                       0                            14190 01-01-1970
>  3:                       0                            14190 01-01-1970
>  4:                       0                            14190 01-01-1970
>  5:                       0                            14190 01-01-1970
>  6:                       0                            14190 01-01-1970

The first one is the value of the days since 01-01-1970 when the vehicle, part or group was produced. To find out the date of production of the unit, we add first one as value to the origin date. In the same step we delete the described columns and change the order of the variables for a normalized appearance.

# List with affected parts:
cDate_Parts     <- c("Einzelteil_T03", "Einzelteil_T04", "Einzelteil_T06", 
                     "Einzelteil_T07", "Einzelteil_T08", "Einzelteil_T10",
                     "Einzelteil_T11", "Einzelteil_T13", "Einzelteil_T14", 
                     "Einzelteil_T18", "Einzelteil_T19", "Einzelteil_T20",
                     "Einzelteil_T21", "Einzelteil_T25", "Einzelteil_T26", 
                     "Einzelteil_T27", "Einzelteil_T31", "Einzelteil_T33",
                     "Einzelteil_T34", "Einzelteil_T36", "Einzelteil_T37", 
                     "Einzelteil_T40")

# List with affected components:
cDate_Groups    <- c("Komponente_K1BE1", "Komponente_K1BE2", "Komponente_K1DI2",
                     "Komponente_K2LE2", "Komponente_K2ST2", "Komponente_K3SG2", 
                     "Komponente_K6", "Komponente_K7", "Komponente_K3AG2")

Vehicle_Data_21 <- mutate(Vehicle_Data_21, 
                          Produktionsdatum = Produktionsdatum_Origin_01011970 + 
                            as.Date(origin, format = "%d-%m-%Y")) %>% 
                          dplyr::select(-7, -8)
Vehicle_Data_21 <- Vehicle_Data_21[,c(1,7,2,3,4,5,6)]

Vehicle_Data_22 <- mutate(Vehicle_Data_22, 
                          Produktionsdatum = Produktionsdatum_Origin_01011970 + 
                            as.Date(origin, format = "%d-%m-%Y")) %>% 
                          dplyr::select(-7, -8)
Vehicle_Data_22 <- Vehicle_Data_22[,c(1,7,2,3,4,5,6)]

for (i in 1:length(cDate_Parts)){
  do.call("<-", list(cDate_Parts[i], 
                     mutate(get(cDate_Parts[i]), 
                          Produktionsdatum = Produktionsdatum_Origin_01011970 + 
                            as.Date(origin, format = "%d-%m-%Y")) %>% 
                          dplyr::select(-7, -8)))
}

# Change the order of several Datasets:
for (i in 1:length(cDate_Parts)){
  Data_Name     <- get(cDate_Parts[i])
  do.call("<-", list(cDate_Parts[i], get(cDate_Parts[i])[,c(1,7,2,3,4,5,6)]))
}

for (i in 1:length(cDate_Groups)){
  do.call("<-", list(cDate_Groups[i], 
                     mutate(get(cDate_Groups[i]), 
                          Produktionsdatum = Produktionsdatum_Origin_01011970 + 
                            as.Date(origin, format = "%d-%m-%Y")) %>% 
                          dplyr::select(-7, -8)))
}
for (i in 1:length(cDate_Groups)){
  Data_Name     <- get(cDate_Groups[i])
  do.call("<-", list(cDate_Groups[i], get(cDate_Groups[i])[,c(1,7,2,3,4,5,6)]))
}

To make sure all data sets was prepared correctly since yet, we observe all with the str() function. We do this separately for each level of the supply chain, to have a better overview.

Filter out the unrelated cities

The data sets with vehicles data, provides also vehicle ID´s from locations we should not analyze. So we use the list with registration numbers from cities (IDNumbers_Vector), our repair shops work in. We want start combining tables into one by the relations data sets between vehicles and parts. By using gather() we make only one variable for groups contains in each vehicle. And than filter out obsolete registrations.

Relations_Simple_11 <- Relations_Data_11 %>% 
  gather("Typ_Komponente", "ID_Komponente", -ID_Fahrzeug) %>% 
  filter_by_ID(IDNumbers_Vector)

Relations_Simple_12 <- Relations_Data_12 %>% 
  gather("Typ_Komponente", "ID_Komponente", -ID_Fahrzeug) %>% 
  filter_by_ID(IDNumbers_Vector)

Relations_Simple_21 <- Relations_Data_21 %>% 
  gather("Typ_Komponente", "ID_Komponente", -ID_Fahrzeug) %>% 
  filter_by_ID(IDNumbers_Vector)

Relations_Simple_22 <- Relations_Data_22 %>% 
  gather("Typ_Komponente", "ID_Komponente", -ID_Fahrzeug) %>% 
  filter_by_ID(IDNumbers_Vector)

Create one table

In following steps we start creating one table from tables we imported and prepared before.

# all registered vehicles outcasted in the right period binded by rows: 
Vehicle_Data_all <- tibble()
Vehicle_Data_all <- Vehicle_Data_all %>% 
      rbind(Vehicle_Data_11, Vehicle_Data_12, Vehicle_Data_21, Vehicle_Data_22) 

# Relations between all vehicles and components gathered into two vars:
Relations_Simple <- tibble()
# iterative execution of following rows conducts corruption of the names and tibble() does not reset the names
# Relations_Simple <- rename(Relations_Simple, colnames(Relations_Simple) = colnames(Relations_Simple_11))
Relations_Simple <- Relations_Simple %>% 
      rbind(Relations_Simple_11, Relations_Simple_12, 
            Relations_Simple_21, Relations_Simple_22) %>% 
      left_join(Vehicle_Data_all, by = c("ID_Fahrzeug"))

Rename first column of components data and bind all components sets one below other. Afterward change the column order for better reading.

# all components from the right period bind by rows:
Komponents_List <- sort(unique(paste0("Komponente_", sub("*-.*", "", Relations_Simple$ID_Komponente))))
Komponents_Data_all <- get(Komponents_List[1])

head(Komponents_Data_all)
>  # A tibble: 6 × 7
>    ID_Motor          Produktionsdatum Herste…¹ Werks…² Fehle…³ Fehlerha…⁴ Fehle…⁵
>    <chr>             <date>              <dbl>   <dbl>   <dbl> <date>       <dbl>
>  1 K1BE1-101-1011-7  2008-11-12            101    1011       0 NA               0
>  2 K1BE1-101-1011-12 2008-11-13            101    1011       0 NA               0
>  3 K1BE1-101-1011-90 2008-11-13            101    1011       0 NA               0
>  4 K1BE1-101-1011-2  2008-11-12            101    1011       0 NA               0
>  5 K1BE1-101-1011-8  2008-11-13            101    1011       0 NA               0
>  6 K1BE1-101-1011-11 2008-11-13            101    1011       0 NA               0
>  # … with abbreviated variable names ¹​Herstellernummer, ²​Werksnummer,
>  #   ³​Fehlerhaft, ⁴​Fehlerhaft_Datum, ⁵​Fehlerhaft_Fahrleistung
Komponents_Data_all <- dplyr::rename(Komponents_Data_all, 
                              "ID_Komponente" = colnames(Komponents_Data_all[1]))

# Somehow rename() changes all names except of 1 - the component K1DI2:
for (i in 2:length(Komponents_List)){
  Komponente <- tibble(get(Komponents_List[i]))
  Komponente <- dplyr::rename(Komponente, "ID_Komponente" = colnames(Komponente[1]))
  Komponents_Data_all <- rbind(Komponents_Data_all, Komponente)
}

do.call("<-", list("Relations_Simple", get("Relations_Simple")[,c(1,4,5,6,7,8,9,2,3)]))
Relations_Simple <- Relations_Simple %>% 
                    left_join(Komponents_Data_all, by = c("ID_Komponente"), 
                              suffix = c("_Fahrzeug", "_Komponente"))

Doe the same for relations between components and single parts. Gather() to simplify the relations set and rename the first column to bind and join it to the single data set.

Groups_Simple_K1BE1 <- Bestandteile_Komponente_K1BE1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K1BE1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K1BE1")
Groups_Simple_K1BE2 <- Bestandteile_Komponente_K1BE2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K1BE2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K1BE2")
Groups_Simple_K1DI1 <- Bestandteile_Komponente_K1DI1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K1DI1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K1DI1")
Groups_Simple_K1DI2 <- Bestandteile_Komponente_K1DI2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K1DI2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K1DI2")
Groups_Simple_K2LE1 <- Bestandteile_Komponente_K2LE1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K2LE1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K2LE1")
Groups_Simple_K2LE2 <- Bestandteile_Komponente_K2LE2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K2LE2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K2LE2")
Groups_Simple_K2ST1 <- Bestandteile_Komponente_K2ST1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K2ST1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K2ST1")
Groups_Simple_K2ST2 <- Bestandteile_Komponente_K2ST2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K2ST2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K2ST2")
Groups_Simple_K3AG1 <- Bestandteile_Komponente_K3AG1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K3AG1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K3AG1")
Groups_Simple_K3AG2 <- Bestandteile_Komponente_K3AG2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K3AG2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K3AG2")
Groups_Simple_K3SG1 <- Bestandteile_Komponente_K3SG1 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K3SG1) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K3SG1")
Groups_Simple_K3SG2 <- Bestandteile_Komponente_K3SG2 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K3SG2) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K3SG2")
Groups_Simple_K4 <- Bestandteile_Komponente_K4 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K4) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K4")
Groups_Simple_K5 <- Bestandteile_Komponente_K5 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K5) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K5")
Groups_Simple_K6 <- Bestandteile_Komponente_K6 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K6) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K6")
Groups_Simple_K7 <- Bestandteile_Komponente_K7 %>% 
                    gather("Typ_Einzelteil", "ID_Einzelteil", -ID_K7) %>% 
                    dplyr::rename("ID_Komponente" = "ID_K7")

Groups_Simple_all <- rbind(Groups_Simple_K7, Groups_Simple_K6, Groups_Simple_K5, 
                           Groups_Simple_K4, Groups_Simple_K3SG2, 
                           Groups_Simple_K3SG1, Groups_Simple_K3AG1, 
                           Groups_Simple_K3AG2, Groups_Simple_K2ST2, 
                           Groups_Simple_K2ST1, Groups_Simple_K2LE2, 
                           Groups_Simple_K2LE1, Groups_Simple_K1DI2, 
                           Groups_Simple_K1DI1, Groups_Simple_K1BE2, 
                           Groups_Simple_K1BE1)

Data_all <- left_join(Relations_Simple, Groups_Simple_all, 
                      by = c("ID_Komponente"), suffix = c("", "_Komponente"))

Create final dataset

Create one data set from the parts data to leftjoin it to the single set.

for (i in 1:length(Imported_Parts)){
  do.call("<-", list(Imported_Parts[i],
                     dplyr::rename(get(Imported_Parts[i]), 
                                "ID_Einzelteil" = colnames(get(Imported_Parts[i])[,1]))))
}

for (i in 1:length(Imported_Parts)){
  do.call("<-", list(Imported_Parts[i], as_tibble(get(Imported_Parts[i]))))
}
  
Parts_all <- rbind(Einzelteil_T01, Einzelteil_T02, Einzelteil_T03, 
                   Einzelteil_T04, Einzelteil_T05, Einzelteil_T06,
                   Einzelteil_T07, Einzelteil_T08, Einzelteil_T09, 
                   Einzelteil_T10, Einzelteil_T11, Einzelteil_T12, 
                   Einzelteil_T13, Einzelteil_T14, Einzelteil_T15,
                   Einzelteil_T16, Einzelteil_T17, Einzelteil_T18, 
                   Einzelteil_T20, Einzelteil_T21, Einzelteil_T22, 
                   Einzelteil_T23, Einzelteil_T24, Einzelteil_T25, 
                   Einzelteil_T26, Einzelteil_T27, Einzelteil_T30, 
                   Einzelteil_T31, Einzelteil_T32, Einzelteil_T19,
                   Einzelteil_T33, Einzelteil_T34, Einzelteil_T35, 
                   Einzelteil_T36, Einzelteil_T37, Einzelteil_T38, 
                   Einzelteil_T39, Einzelteil_T40)

Data_all_1 <- left_join(Data_all, Parts_all, by = c("ID_Einzelteil"), 
                        suffix = c("", "_Einzelteil"))

Finaly add the geographic data to the registrations and join it with the master-data set. As last step rearrange the collumns for better looking.

Registrations_Data <- dplyr::rename(Registrations_Data, "Gemeinde" = "Gemeinden")

Registrations_Geo <- full_join(Registrations_Data, Geodata, by = "Gemeinde")
Registrations_Geo <- dplyr::rename(Registrations_Geo, "ID_Fahrzeug"= "IDNummer")
Data_all_2 <- left_join(Data_all_1, Registrations_Geo, by = "ID_Fahrzeug")
Data_all_2 <- Data_all_2[,c(24, 25, 26, 27, 28, 1, 6, 
                            8, 9, 14, 16, 17, 21, 22)]

The vehicle is always considered defective if an installed individual part, an installed component or the entire vehicle is marked as defective. As next we create a new column and fill it with values, whenever a individual part, a component or a entire vehicle was defective. For this reason we add a column, which shows us the Vehicle was repaired in one of our shops.

Data_all_3 <- Data_all_2 %>% 
              mutate(Fehlerhaft_Datum_ = as.Date(Fehlerhaft_Datum %>% 
                    is.na() %>% 
                    if_else(is.na(Fehlerhaft_Datum_Fahrzeug) %>% 
                            if_else(as.Date(Fehlerhaft_Datum_Komponente), 
                                    as.Date(Fehlerhaft_Datum_Fahrzeug)),  
                            as.Date(Fehlerhaft_Datum))))

As we can see in the last column below, we still have unnecessary information. As example first rows shows, that there are vehicles never was defective, or vehicles from period we are not interested in.

head(Data_all_3, 20)
>  # A tibble: 20 × 15
>     Gemei…¹ Zulassung  Postl…² Laeng…³ Breit…⁴ ID_Fa…⁵ Fehlerha…⁶ Typ_K…⁷ ID_Ko…⁸
>     <chr>   <date>       <dbl> <chr>   <chr>   <chr>   <date>     <chr>   <chr>  
>   1 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   2 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   3 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   4 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   5 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   6 ASCHAF… 2009-01-02   63739 9,1478… 49,973… 11-1-1… NA         ID_Kar… K4-112…
>   7 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>   8 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>   9 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>  10 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>  11 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>  12 LANDSH… 2009-01-02   84028 12,158… 48,538… 11-1-1… NA         ID_Kar… K4-112…
>  13 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  14 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  15 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  16 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  17 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  18 BAYREU… 2009-01-02   95444 11,571… 49,945… 11-1-1… NA         ID_Kar… K4-112…
>  19 BAMBERG 2009-01-02   96047 10,886… 49,893… 11-1-1… NA         ID_Kar… K4-112…
>  20 BAMBERG 2009-01-02   96047 10,886… 49,893… 11-1-1… NA         ID_Kar… K4-112…
>  # … with 6 more variables: Fehlerhaft_Datum_Komponente <date>,
>  #   Typ_Einzelteil <chr>, ID_Einzelteil <chr>, Fehlerhaft <dbl>,
>  #   Fehlerhaft_Datum <date>, Fehlerhaft_Datum_ <date>, and abbreviated variable
>  #   names ¹​Gemeinde, ²​Postleitzahl, ³​Laengengrad, ⁴​Breitengrad, ⁵​ID_Fahrzeug,
>  #   ⁶​Fehlerhaft_Datum_Fahrzeug, ⁷​Typ_Komponente, ⁸​ID_Komponente
>  # ℹ Use `colnames()` to see all variable names

Filter out this vehicles by the time limits.

Data_all_3 <- subset(Data_all_3, 
                          (Data_all_3$Fehlerhaft_Datum_ <= "2016-12-31" &
                            Data_all_3$Fehlerhaft_Datum_ >= "2014-1-1"))

glimpse(Data_all_3)
>  Rows: 146,594
>  Columns: 15
>  $ Gemeinde                    <chr> "ASCHAFFENBURG", "ROSENHEIM", "ROSENHEIM",…
>  $ Zulassung                   <date> 2012-05-21, 2012-05-21, 2012-05-21, 2012-…
>  $ Postleitzahl                <dbl> 63739, 83022, 83022, 83022, 83022, 83022, …
>  $ Laengengrad                 <chr> "9,147832", "12,12482", "12,12482", "12,12…
>  $ Breitengrad                 <chr> "49,973506", "47,855828", "47,855828", "47…
>  $ ID_Fahrzeug                 <chr> "11-1-11-480692", "11-1-11-480744", "11-1-…
>  $ Fehlerhaft_Datum_Fahrzeug   <date> NA, NA, NA, NA, NA, NA, NA, 2013-06-14, N…
>  $ Typ_Komponente              <chr> "ID_Karosserie", "ID_Karosserie", "ID_Karo…
>  $ ID_Komponente               <chr> "K4-114-1141-9645", "K4-114-1141-10551", "…
>  $ Fehlerhaft_Datum_Komponente <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
>  $ Typ_Einzelteil              <chr> "ID_T32", "ID_T32", "ID_T32", "ID_T32", "I…
>  $ ID_Einzelteil               <chr> "32-216-2162-399461", "32-216-2162-399373"…
>  $ Fehlerhaft                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
>  $ Fehlerhaft_Datum            <date> 2014-01-03, 2014-01-03, 2014-01-03, 2014-…
>  $ Fehlerhaft_Datum_           <date> 2014-01-03, 2014-01-03, 2014-01-03, 2014-…

Write a file for submit.

write.csv2(Data_all_3, "Final_Dataset_Group_05.csv")

Data evaluation by means of Shiny application

Task 4.a. A bar chart stacking the number of failures for each part for each month. It should be possible to filter by location.

Data processing for bar chart

Before that we finished extracting and organizing the important data, and then we will visualize the data with the help of Shiny app. The first will use the geom_bar function to show the number of failures per part per city for the last three years, which can be used to allow the user to select the part and city themselves through the selection function.

Here, part ID_T25 is used as an example to analyze its data, first selecting the data to be used from the final csv data, such as part type, city, failure condition and time of failure. Since only the part in which the failure occurred is considered here, it is necessary to filter Fehlerhaft with the filter function and then use the filter function to select the data for part ID_T25. Then use the time function to convert the time when the failure occurred, followed by the group_by function to group the data so that the number of failures can be calculated.

Final_Daten <- read.csv2("Final_Dataset_Group_05.csv")
data <- Final_Daten %>%
  filter(Fehlerhaft>=1) %>%
  dplyr::select(c("Typ_Einzelteil","Gemeinde", "Fehlerhaft", "Fehlerhaft_Datum")) %>%
  filter(Typ_Einzelteil == "ID_T25") %>%
  mutate(Month = month(Fehlerhaft_Datum), Year = year(Fehlerhaft_Datum)) %>% 
  mutate(date = as.Date(format.Date(Fehlerhaft_Datum, "%Y-%m-1"), "%Y-%m-%d")) %>%
  group_by(Gemeinde, date, Typ_Einzelteil) %>%
  dplyr::summarise(Anzahl = n(), .groups = "keep")

head(data)
>  # A tibble: 6 × 4
>  # Groups:   Gemeinde, date, Typ_Einzelteil [6]
>    Gemeinde      date       Typ_Einzelteil Anzahl
>    <chr>         <date>     <chr>           <int>
>  1 ASCHAFFENBURG 2014-02-01 ID_T25              3
>  2 ASCHAFFENBURG 2014-03-01 ID_T25              3
>  3 ASCHAFFENBURG 2014-04-01 ID_T25              1
>  4 ASCHAFFENBURG 2014-05-01 ID_T25              3
>  5 ASCHAFFENBURG 2014-07-01 ID_T25              2
>  6 ASCHAFFENBURG 2014-09-01 ID_T25              3

Visualization

The data is finally visualized using the geom_bar function, where the horizontal coordinate is the time and the vertical coordinate is the number of faults, and the text size and scale of the axes are adjusted. In addition to using function ggplotly to make a graphic interactive. The pretty() function is used to determine the sequence of about equally spaced round values.

# Shown below is the number of part ID_T25 failures per month in ten cities between 2014 and 2016
p <- ggplot(data, aes(x = date, y = Anzahl, fill = Gemeinde, group = Gemeinde)) +
        geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values=c("red", "blue", "green", "yellow","goldenrod3",
                              "yellowgreen","rosybrown3", "magenta","cyan","cornsilk4")) +
        labs(x = "Month", y = "The number of failures", title = "The number of failures for each part for each month") +
  scale_x_date(
          breaks = "1 month",
          limits = as.Date(c('2014-01','2016-12'), format = "%Y-%b"),
          labels = date_format(format = "%Y-%b", tz = "ECT"),
        ) +
  scale_y_continuous(
          breaks = function(y) {
            pretty(y)
          }
        ) +
  theme(
          axis.text.x = element_text(angle = 45, hjust = 0, size = 4),
          axis.text.y = element_text(size = 10),
          axis.title = element_text(size = 10)
        )

ggplotly(p)

Task 4.b. The outage history and a forecast for the outage of a selectable part for the 1st quarter of 2017. Both the total outage, and the outage per city mentioned above, should be apparent from the visualization.

4.b.1. Data processing for the outage history

The data for each part in ten cities over the last three years are processed here, mainly to count the total number of failures per part per city.

total_outage_data <- Final_Daten %>%
  filter(Fehlerhaft >= 1) %>%
  dplyr::select(c("Typ_Einzelteil", "Gemeinde", "Fehlerhaft", "Fehlerhaft_Datum")) %>%
  mutate(Monat = as.Date(format.Date(Fehlerhaft_Datum, "%Y-%m-1"), "%Y-%m-%d")) %>%
  group_by(Typ_Einzelteil) %>%
  dplyr::summarise(Anzahl = n(), .groups = "keep")

print(total_outage_data)
>  # A tibble: 38 × 2
>  # Groups:   Typ_Einzelteil [38]
>     Typ_Einzelteil Anzahl
>     <chr>           <int>
>   1 ID_T1            8223
>   2 ID_T10            465
>   3 ID_T11           2885
>   4 ID_T12           2380
>   5 ID_T13           2254
>   6 ID_T14            622
>   7 ID_T15            596
>   8 ID_T16            894
>   9 ID_T17            755
>  10 ID_T18            742
>  # … with 28 more rows
>  # ℹ Use `print(n = ...)` to see more rows

Visualization

The data is finally visualized using the geom_bar function, where the horizontal coordinate is the Typ_Einzelteil and the vertical coordinate is the number of failures, and the text size and scale of the axes are adjusted. In addition to using function ggplotly to make a graphic interactive. The pretty() function is used to determine the sequence of about equally spaced round values.

p <- ggplot(total_outage_data, aes(x = Typ_Einzelteil, y = Anzahl, color = Typ_Einzelteil, fill = Typ_Einzelteil)) +
  geom_bar(stat = "identity") +
  labs(x = "Typ_Einzelteil", y = "The number of failures", title = "The total number of failures for each part") +
  scale_y_continuous(
    breaks = function(y) {
      unique(floor(pretty(y)))
    }
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 0, size = 4),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 10),
  )
ggplotly(p)

4.b.2. Forecast for the outage of a selectable part for the 1st quarter of 2017

An analysis of the number of failures per part per city for the past three years needs to be performed before forecasting deliveries per part per city for the first quarter of 2017. First, the number of failures per city, per quarter, per part for the past three years needs to be counted, and then the relationship between time and number of failures needs to be considered.

Data processed for Part ID_T25 delivery forecast in city ASCHAFFENBURG for the first quarter of 2017

Since quarter cannot be used as a continuous variable, quarter is used here as a dummy variable to consider the linear relationship between quarter and number of failures. Here the linear relationship between its time and the number of failures is analyzed with the column of part ID_T25 in ASCHAFFENBURG, and then its prediction is made.

 data_forecast_ID_T25 <- Final_Daten %>%
   filter(Fehlerhaft >= 1) %>%
   dplyr::select(c("Typ_Einzelteil", "Gemeinde", "Fehlerhaft", "Fehlerhaft_Datum")) %>%
   filter(Gemeinde == "ASCHAFFENBURG") %>%
   filter(Typ_Einzelteil == "ID_T25") %>%
   mutate(Month = month(Fehlerhaft_Datum), Year = year(Fehlerhaft_Datum)) %>%
   mutate(Monat = as.Date(format.Date(Fehlerhaft_Datum, "%Y-%m-1"), "%Y-%m-%d")) %>%
   mutate(Quarter = quarters(as.POSIXlt(Fehlerhaft_Datum))) %>%
   group_by(Year, Monat, Typ_Einzelteil, Gemeinde) %>%
   dplyr::summarise(Anzahl = n(), .groups = "keep")
 
 print(data_forecast_ID_T25)
>  # A tibble: 19 × 5
>  # Groups:   Year, Monat, Typ_Einzelteil, Gemeinde [19]
>      Year Monat      Typ_Einzelteil Gemeinde      Anzahl
>     <int> <date>     <chr>          <chr>          <int>
>   1  2014 2014-02-01 ID_T25         ASCHAFFENBURG      3
>   2  2014 2014-03-01 ID_T25         ASCHAFFENBURG      3
>   3  2014 2014-04-01 ID_T25         ASCHAFFENBURG      1
>   4  2014 2014-05-01 ID_T25         ASCHAFFENBURG      3
>   5  2014 2014-07-01 ID_T25         ASCHAFFENBURG      2
>   6  2014 2014-09-01 ID_T25         ASCHAFFENBURG      3
>   7  2014 2014-10-01 ID_T25         ASCHAFFENBURG      1
>   8  2014 2014-11-01 ID_T25         ASCHAFFENBURG      3
>   9  2014 2014-12-01 ID_T25         ASCHAFFENBURG      1
>  10  2015 2015-02-01 ID_T25         ASCHAFFENBURG      3
>  11  2015 2015-03-01 ID_T25         ASCHAFFENBURG      2
>  12  2015 2015-04-01 ID_T25         ASCHAFFENBURG      6
>  13  2015 2015-09-01 ID_T25         ASCHAFFENBURG      2
>  14  2015 2015-10-01 ID_T25         ASCHAFFENBURG      2
>  15  2016 2016-01-01 ID_T25         ASCHAFFENBURG      2
>  16  2016 2016-07-01 ID_T25         ASCHAFFENBURG      1
>  17  2016 2016-08-01 ID_T25         ASCHAFFENBURG      2
>  18  2016 2016-09-01 ID_T25         ASCHAFFENBURG      2
>  19  2016 2016-11-01 ID_T25         ASCHAFFENBURG      3

Since there are no failures in a particular quarter when counting the number of failures in each quarter, the pad_by_time function is used to interpolate the number of failures by 0, thus avoiding data mismatch at a later stage.

# 0 interpolation for missing months
g1 <- tibble(data_forecast_ID_T25) %>%
  pad_by_time(Monat, .by = "months", .start_date = "2014-01-01", .end_date = "2016-12-31", .pad_value = 0) %>%
  dplyr::select(c("Monat", "Anzahl"))

print(g1)
>  # A tibble: 36 × 2
>     Monat      Anzahl
>     <date>      <int>
>   1 2014-01-01      0
>   2 2014-02-01      3
>   3 2014-03-01      3
>   4 2014-04-01      1
>   5 2014-05-01      3
>   6 2014-06-01      0
>   7 2014-07-01      2
>   8 2014-08-01      0
>   9 2014-09-01      3
>  10 2014-10-01      1
>  # … with 26 more rows
>  # ℹ Use `print(n = ...)` to see more rows

The number of failures per quarter is then calculated based on the number of failures per month for part ID_T25 for the past three years. After that it is necessary to match the dummy variable with the number of failures per quarter.

# Calculate the number of failures per quarter for part ID_T25
g2 <- data.frame(Anzahl = c(sum(g1$Anzahl[1:3]), sum(g1$Anzahl[4:6]), sum(g1$Anzahl[7:9]), sum(g1$Anzahl[10:12]), sum(g1$Anzahl[13:15]), sum(g1$Anzahl[16:18]), sum(g1$Anzahl[19:21]), sum(g1$Anzahl[22:24]), sum(g1$Anzahl[25:27]), sum(g1$Anzahl[28:30]), sum(g1$Anzahl[31:33]), sum(g1$Anzahl[34:36])))

# Match the dummy variable with the number of failures per quarter
Quarter <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

model_data <- data.frame(Quarter, Anzahl = g2$Anzahl)

print(model_data)
>     Quarter Anzahl
>  1        1      6
>  2        2      4
>  3        3      5
>  4        4      5
>  5        5      5
>  6        6      6
>  7        7      2
>  8        8      2
>  9        9      2
>  10      10      0
>  11      11      5
>  12      12      3

Visualization and evaluation

Build a linear model of the number of failures per quarter versus the dummy variable (Quarter) using the lm function

A linear model of the number of failures per quarter versus a dummy variable (Quarter) is first created using the lm function, and then using the summary function, the coefficients of the linear relationship can be viewed.

# Build a linear model of the number of failures per quarter versus the dummy variable (Quarter) using the lm function
liner_model <- lm(formula = Anzahl ~ Quarter, model_data)

# Using the summary function you can view the coefficients
summary(liner_model)
>  
>  Call:
>  lm(formula = Anzahl ~ Quarter, data = model_data)
>  
>  Residuals:
>      Min      1Q  Median      3Q     Max 
>  -2.6853 -1.1626  0.3374  0.8260  2.6189 
>  
>  Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
>  (Intercept)   5.7273     1.0116   5.661 0.000209 ***
>  Quarter      -0.3042     0.1375  -2.213 0.051292 .  
>  ---
>  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>  
>  Residual standard error: 1.644 on 10 degrees of freedom
>  Multiple R-squared:  0.3288, Adjusted R-squared:  0.2616 
>  F-statistic: 4.898 on 1 and 10 DF,  p-value: 0.05129
# intercept of the linear function
intercept <- liner_model[[1]][[1]]

# slope of the linear function
slope <- liner_model[[1]][[2]]

Q1_2017 <- data.frame(Quarter = 13)

# Save the predicted values
# model_data$predicted <- predict(liner_model)
# # Save the residual values
# model_data$residuals <- residuals(liner_model) 

Visualization

Finally the linear relationship is visualized using the geom_bar function, with the horizontal coordinate being time and the vertical coordinate being the number of failures, and the predicted data for the first quarter of 2017 is added. In addition to this use the function ggplotly to make the graph interactive outside. The pretty() function is used to determine the sequence of about equally spaced round values.

p <- ggplot() +
  geom_point(aes(x = Quarter, y = Anzahl), data = model_data, color = "blue") +
  geom_abline(intercept = intercept, slope = slope, linetype = 2, data = model_data) +
  geom_point(aes(x = Q1_2017$Quarter, y = predict(liner_model, newdata = Q1_2017)), colour = "red", size = 3) +
  labs(x = "Quarter", y = "The number of failures", title = "The linear model of failures for part ID_T25") +
  scale_x_continuous(
    breaks = function(x) {
      pretty(x, n = 13)
    }
  )
ggplotly(p)

Regression diagnostics

Residuals vs Fitted.

Previously, we established a linear relationship between the number of part failures and time. Now, this linear relationship needs further validation. The residuals are the differences between the predicted and measured outputs of the model. Thus, the residuals represent the part of the data that is not explained by the model. It is used to check the assumptions of the linear relationship. As can be seen on the graph, the fluctuation of the residuals is limited to 1. Since the number of parts must be integer, it can be considered as a horizontal line indicating a linear relationship.

# par(mfrow = c(2, 2))
plot(liner_model, which = 1, col=c("blue"))
>  Error : The fig.showtext code chunk option must be TRUE

Normal Q-Q.

Now make a Q-Q plot, which is useful for determining whether the residuals follow a normal distribution. As you can see from the graph, the residuals are close to a straight line at an approximate angle of 45 degrees, which suggests that he is normally distributed.

res <- resid(liner_model)

#create Q-Q plot for residuals
qqnorm(res)
>  Error : The fig.showtext code chunk option must be TRUE
#add a straight diagonal line to the plot
qqline(res)

Scale-Location

Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

plot(liner_model, which = 3, col=c("blue"))
>  Error : The fig.showtext code chunk option must be TRUE

Residuals vs Leverage

Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.

plot(liner_model, which = 5, col=c("blue"))
>  Error : The fig.showtext code chunk option must be TRUE

4.b.3. Forecast of delivery volume per part per city

Data processing for the number of failures of part ID_T25

Previously the forecast for part ID_T25 in the city of ASCHAFFENBURG was completed, and now the forecast for part ID_T25 in ten cities is counted. There is no need to filter the cities here. Since predicting ID_T25 parts for each of the ten cities is more repetitive work, a custom function is used here, mainly to loop through the cities. The data for the ten cities will then be collated together using the data.frame function.

forecast_data <- Final_Daten %>%
  dplyr::select(c("Typ_Einzelteil", "Gemeinde", "Fehlerhaft", "Fehlerhaft_Datum")) %>%
  filter(Typ_Einzelteil == "ID_T25") %>%
  filter(Fehlerhaft >= 1) %>%
  mutate(Month = month(Fehlerhaft_Datum), Year = year(Fehlerhaft_Datum)) %>%
  mutate(Quarter = quarters(as.POSIXlt(Fehlerhaft_Datum))) %>%
  group_by(Year, Quarter, Gemeinde, Typ_Einzelteil) %>%
  dplyr::summarise(Anzahl = n(), .groups = "keep")

d1 <- spread(forecast_data, Gemeinde, Anzahl)

d2 <- data.frame(d1)

d1[is.na(d1)] <- 0

Quarter <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

f <- function(i) {
  Quarter <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

  data <- data.frame(Quarter, d1[, i])

  # fiting the linear model
  liner_model <- lm(formula = as.formula(paste(colnames(data)[2], "~ .")), data)

  Q1_2017 <- data.frame(Quarter = 13)

  # predicts the future values
  q1 <- ceiling(predict(liner_model, newdata = Q1_2017))
  return(q1)
}

predict_Q1 <- data.frame(Prediction = c(f(4),f(5),f(6),f(7),f(8),f(9),f(10),f(11),f(12),f(13)), Gemeinde = c("ASCHAFFENBURG","AUGSBURG","BAMBERG","BAYREUTH","ERLANGEN","INGOLSTADT","LANDSHUT","REGENSBURG","ROSENHEIM","WUERZBURG"))

head(predict_Q1)
>    Prediction      Gemeinde
>  1          2 ASCHAFFENBURG
>  2         11      AUGSBURG
>  3          4       BAMBERG
>  4          5      BAYREUTH
>  5          2      ERLANGEN
>  6          3    INGOLSTADT

Visualization

Finally, ID_T25’s forecasted numbers for Q1 2017 in ten cities are visualized using the geom_bar function, with the horizontal coordinates being the cities and the vertical coordinates being the forecasted numbers. In addition to this use the function ggplotly to make the graph interactive outside. The pretty() function is used to determine the sequence of about equally spaced round values.

p <- ggplot(predict_Q1, aes(x = Gemeinde, y = Prediction, color = Gemeinde, fill = Gemeinde)) +
  geom_bar(stat = "identity", width = 0.5) +
  labs(x = "Gemeinde", y = "The number of prediction", title = "The Forecast number of part ID_T25 per city") +
  scale_y_continuous(
    breaks = function(y) {
      pretty(y)
    }
  ) +
  theme(
    axis.text.x = element_text(size = 5),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 10)
  )

ggplotly(p)

4.b.4 Forecast of total deliveries per part in ten cities

Data processed as deliverables per part for the first quarter of 2017

The data analysis and forecasting for part ID_T25 was completed prior to this. Now counting the total deliveries of 38 parts in 10 cities. Since the data is complex, two functions are used here to loop through the parts and cities, and then all the predicted quantities are counted in a table.

forecast_data_total <- Final_Daten %>%
  dplyr::select(c("Typ_Einzelteil", "Gemeinde", "Fehlerhaft", "Fehlerhaft_Datum")) %>%
  filter(Fehlerhaft >= 1) %>%
  mutate(Month = month(Fehlerhaft_Datum), Year = year(Fehlerhaft_Datum)) %>%
  mutate(Quarter = quarters(as.POSIXlt(Fehlerhaft_Datum))) %>%
  group_by(Typ_Einzelteil, Gemeinde, Year, Quarter) %>%
  dplyr::summarise(Anzahl = n(), .groups = "keep") %>%
  spread(Gemeinde, Anzahl, fill = 0)

total_1 <- forecast_data_total[(1 * 12 - 11):(1 * 12), ]

# Custom functions are used to calculate a linear relationship between the number of failures and time for each part over the past three years.

f <- function(i, j) {
  Quarter <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

  forecast_data_total_1 <- forecast_data_total[(j * 12 - 11):(j * 12), ]

  data <- data.frame(Quarter, forecast_data_total_1[, i])

  # fiting the linear model
  liner_model <- lm(formula = as.formula(paste(colnames(data)[2], "~ .")), data)

  Q1_2017 <- data.frame(Quarter = 13)
  # predicts the future values

  q1 <- ceiling(predict(liner_model, newdata = Q1_2017))
  return(q1)
}

# Custom function to count the number of predictions per city per part
f.total <- function(i) {
  predict_Q1 <- data.frame(Gemeinde = c("ASCHAFFENBURG", "AUGSBURG", "BAMBERG", "BAYREUTH", "ERLANGEN", "INGOLSTADT", "LANDSHUT", "REGENSBURG", "ROSENHEIM", "WUERZBURG"), Prediction = c(f(4, i), f(5, i), f(6, i), f(7, i), f(8, i), f(9, i), f(10, i), f(11, i), f(12, i), f(13, i)))

  predict_t <- data.frame(Prediction <- sum(predict_Q1$Prediction))
  return(predict_t)
}


Prediction <- list(
  f.total(1), f.total(2), f.total(3), f.total(4), f.total(5), f.total(6), f.total(7), f.total(8), f.total(9), f.total(10), f.total(11), f.total(12), f.total(13), f.total(14), f.total(15), f.total(16), f.total(17), f.total(18),
  f.total(19), f.total(20), f.total(21), f.total(22), f.total(23), f.total(24), f.total(25), f.total(26), f.total(27), f.total(28), f.total(29), f.total(30), f.total(31), f.total(32), f.total(33), f.total(34), f.total(35), f.total(36), f.total(37), f.total(38)
)

data_p <- rbindlist(Prediction)
colnames(data_p) <- c("Prediction")

# Merge the data of the parts
predict_total <- data.frame(
  Typ_Einzelteil = c("ID_T01", "ID_T10", "ID_T11", "ID_T12", "ID_T13", "ID_T14", "ID_T15", "ID_T16", "ID_T17", "ID_T18", "ID_T19", "ID_T02", "ID_T20", "ID_T21", "ID_T22", "ID_T23", "ID_T24", "ID_T25", "ID_T26", "ID_T27", "ID_T03", "ID_T30", "ID_T31", "ID_T32", "ID_T33", "ID_T34", "ID_T35", "ID_T36", "ID_T37", "ID_T38", "ID_T39", "ID_T04", "ID_T40", "ID_T05", "ID_T06", "ID_T07", "ID_T08", "ID_T09"),
  Prediction = data_p
)

print(predict_total)
>     Typ_Einzelteil Prediction
>  1          ID_T01        651
>  2          ID_T10         41
>  3          ID_T11        249
>  4          ID_T12        197
>  5          ID_T13        219
>  6          ID_T14         56
>  7          ID_T15         57
>  8          ID_T16         78
>  9          ID_T17         68
>  10         ID_T18         54
>  11         ID_T19         20
>  12         ID_T02        294
>  13         ID_T20         20
>  14         ID_T21        325
>  15         ID_T22        280
>  16         ID_T23        203
>  17         ID_T24         60
>  18         ID_T25         43
>  19         ID_T26         64
>  20         ID_T27         14
>  21         ID_T03        129
>  22         ID_T30        263
>  23         ID_T31        253
>  24         ID_T32        211
>  25         ID_T33         34
>  26         ID_T34         82
>  27         ID_T35         78
>  28         ID_T36         46
>  29         ID_T37         60
>  30         ID_T38         32
>  31         ID_T39         32
>  32         ID_T04        112
>  33         ID_T40         32
>  34         ID_T05        103
>  35         ID_T06        132
>  36         ID_T07         40
>  37         ID_T08         50
>  38         ID_T09         37

Visualization

Finally, total predicted quantity per part for Q1 2017 in ten cities are visualized using the geom_bar function, with the horizontal coordinates being the part and the vertical coordinates being the forecasted numbers. In addition to this use the function ggplotly to make the graph interactive outside. The pretty() function is used to determine the sequence of about equally spaced round values.

g <- predict_total %>% ggplot(aes(x = Typ_Einzelteil, y = Prediction, fill = Typ_Einzelteil, color = Typ_Einzelteil)) +
  geom_bar(stat = "identity") +
  labs(x = "Typ_Einzelteil", y = "Total predicted quantity per part", title = "Total forecast quantities per part for Q1 2017") +
  scale_y_continuous(
    breaks = function(y) {
      unique(floor(pretty(y)))
    }
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 0, size = 5),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 10)
  )

ggplotly(g)

4.d Your underlying data set as a table, so that you can prove visualized data. Again, remember to show only the necessary attributes.

The relevant data required in this case study is shown here, where Typ_Einzelteil indicates the type of part, Fehlerhaft indicates whether the part is damaged, Fehlerhaft_Datum indicates when the part failed, Gemeinde indicates the city where the part failed, and Postleitzahl, Laengengrad, and Breitengrad are used to indicate the exact location of the city.

table <- Final_Daten[, c(
      "Typ_Einzelteil",
      "Fehlerhaft",
      "Fehlerhaft_Datum",
      "Gemeinde",
      "Postleitzahl",
      "Laengengrad",
      "Breitengrad"
    )]

head(table)
>    Typ_Einzelteil Fehlerhaft Fehlerhaft_Datum      Gemeinde Postleitzahl
>  1         ID_T32          1       2014-01-03 ASCHAFFENBURG        63739
>  2         ID_T32          1       2014-01-03     ROSENHEIM        83022
>  3         ID_T32          1       2014-01-03     ROSENHEIM        83022
>  4         ID_T32          1       2014-01-03     ROSENHEIM        83022
>  5         ID_T32          1       2014-01-03     ROSENHEIM        83022
>  6         ID_T32          1       2014-01-03     ROSENHEIM        83022
>    Laengengrad Breitengrad
>  1    9.147832    49.97351
>  2   12.124820    47.85583
>  3   12.124820    47.85583
>  4   12.124820    47.85583
>  5   12.124820    47.85583
>  6   12.124820    47.85583

Result for Shiny App

Designing a Shiny App consists of two main parts, UI and Server, UI is used to design the app’s interface layout, text, colors and other theme styles, Server is used to process data, analyze data, and transfer data. This App has the following main features:

  • The theme color of the application is light blue, and the font is “Source Sans Pro”, the logo is the department for quality science (Fachgebiet für Qualitätswissenschaften)

  • A bar chart stacking the number of failures for each part for each month. It should be possible to filter by location

  • The outage history and a forecast for the outage of a selectable part for the 1st quarter of 2017. Both the total outage, and the outage per city mentioned above, should be apparent from the visualization.

  • An interactive map highlighting all locations and integrate pop-ups showing your recommended numbers of units for the part selected from b. above.

  • A table showing the data needed for the visualization

A: The number of failures for each part for each month

Here is the first screen of the App, stacking the number of failures per part per month in a bar graph, which can be filtered by location.

The number of failures for each part for each month

The number of failures for each part for each month

B.1: The total number of failures for each part in ten cities

The total number of failures for each part in ten cities over the past three years is shown here

The total number of failures for each part in ten cities

The total number of failures for each part in ten cities

B.2: The total forecast quantities for each part for Q1 2017

The total forecasted number of deliveries per part in ten cities for Q1 2017 is shown here.

The total forecast quantities for each part for Q1 2017

The total forecast quantities for each part for Q1 2017

B.3: The linear model of failures for each part

A linear model of the number of failures and time per part in each city is shown here, which can be filtered by location.

The linear model of failures for each part

The linear model of failures for each part

B.4: The forecast quantities for each part for Q1 2017 in ten cities

The total predicted number of deliveries for each part in each city is shown here.

The forecast quantities for each part for Q1 2017 in ten cities

The forecast quantities for each part for Q1 2017 in ten cities

D: The table showing the data needed for the visualization

The table showing the data needed for the visualization.

The table showing the data needed for the visualization

The table showing the data needed for the visualization

Summary

At this point, the entire case study was concluded, and in the process we analyzed all the tables and then extracted the important data. Then the failure of each part in the ten cities over the last three years was analyzed, a linear model was built and validated, and then the deliveries for the first quarter of 2017 were forecasted. In addition to this visualization using the Shiny App, users can view the data for each part, the forecasted situation, and display their data and geographic location through a map. During the analysis it was found that part ID_T01 had the highest number of failures and therefore its future deliveries should be sufficient.